Introduction

This graphical summary has the following sections

  • Preprocessing
  • Topic modeling
  • Document clustering accross topic hierarchy
  • Topic hierarchy visualization

Preprocessing

Steps

  • Full texts were collected to local Zotero library using Zotero connector
  • Text was extracted using PyPDF2
  • Punctuation was removed using nltk and regular expressions
  • Texts were tokenized using nltk
  • Tokens that occurred in a document less than 3 or more than 950 times were removed as suggested in Khodorchenko, M. et al. (2020).
  • Additionally tokens that consisted of the same two character combination by more than 50% of length were removed.
    • It was discovered via manual inspection of pre-processed dataset that this step helps to reduce number of uninformative tokens
      (also see Figure 1 and 2).
  • Stop word removal was conducted using stop word list from nltk that was extended to reduce the number of uninformative terms.
  • Coocurrence counts was calculated for both datasets using custom python script with window size of 10 tokens.
  • The corpus and coocurrence counts were saved in a format accepted by the BigARTM library and used to construct hierarchical topic models.

Comments on grid plot structure

  • The grid plots shows number of tokens by the maximum observed fraction
    of token length that consisted of the same two character combination.
  • Each row shows result of a dataset split, given fraction of length threshold.
  • The first column shows the counts for tokens above threshold (termed Noise here).
  • The second column shows the counts for tokens below threshold (termed Clean here).
  • The third column shows the wordcloud plot for the "Noise" tokens.
  • The fourth column shows the wordcloud plot for the "Clean" tokens.

Figure 1: Repeated dimer filtering - natural language processing dataset

Figure 2: Repeated dimer filtering - bioinformatics dataset

Topic modeling

Document clustering accross topic hierarchy

  • Hierarchical topic model based on Chirkova, N.A., 2016 allows to calculate a topic distribution for each document in the corpus.
  • Such vectors corresponds to a discrete probability distribution and can be used to compare the documents at a given level of topic hierarchy, similarly to neural network embeddings.
  • Additionally for each level of topic hierarchy (except the first one) it is possible to get vectors representing super-topics in terms of sub-topics. This allows to treat super-topics (at the higher level of hierarchy) as pseudo-documents and include them into document matrix. This is the basis of the approach for calculating topic hierarchy described by Chirkova, N.A., 2016.
  • Calculating Hellinger distance between such vector (documents and pseudo-documents) was suggested as one of the quality metrics for the model in the original publication.
  • Here it was attempted to generalize this approach by implementing document similarity calculation using three additional steps:
    • First, given a matrix of topic-based document distributions (termed Phi), and a matrix of pseudo-document distributions (termed Psi) a combined matrix is generated.
    • Next from such combined matrix the square pairwise distance matrix is calculated using Hellinger distance formula.
    • Finally the distance matrix is converted to similarity matrix using Bhattacharyya coefficient as discussed in Kitsos, C.P. and Nisiotis, C.-S. (2022).
    • Using the resulting similarity matrix it was possible to perform spectral clustering of documents to assign groups of topic-based similarity within each level of topic hierarchy.
    • It was also possible to visually represent documents at each level of hierarchy in two-dimensional space using Multidimensional scaling on the original distance matrix to generate a scatter plot.
    • This plot was additionally annotated with document labels and connections between documents where pairwise Hellinger distance values were below the specified threshold. The resulting plots are shown in figures 4 and 7.
    • The human-in-the-loop evaluation of actual document similarity within discovered groups is to be conducted.

Topic hierarchy visualization

  • In order to show the resulting hierarchy and the connections discovered by the model an additional function was developed to represent all levels of topic hierarchy and include any connections between the layers with model-assigned probability value below a specified threshold.
  • The resulting plots are shown in figures 5 and 8.

Results for BIOIT set

CPU times: user 18.2 s, sys: 2.96 s, total: 21.1 s
Wall time: 13.5 s
level0

topic_0:  ['data', 'sequencing', 'analysis', 'cells', 'cell', 'dna', 'cancer', 'methods', 'used', 'gene']
topic_1:  ['reads', 'genome', 'read', 'data', 'alignment', 'reference', 'variant', 'sequencing', 'genomes', 'coverage']

level1

topic_0:  ['sequencing', 'dna', 'cancer', 'resistance', 'gene', 'ngs', 'genes', 'using', 'detection', 'protein']
topic_1:  ['variant', 'kraken', 'variants', 'regions', 'normalization', 'benchmark', 'species', 'scone', 'snps', 'wgs']
topic_2:  ['reads', 'genome', 'read', 'alignment', 'assembly', 'coverage', 'genomes', 'reference', 'contigs', 'graph']
topic_3:  ['data', 'analysis', 'cell', 'cells', 'methods', 'metagenomic', 'used', 'expression', 'nat', 'metagenomics']

level2

topic_0:  ['alignment', 'bioinformatics', 'tools', 'algorithms', 'umap', 'mapping', 'short', 'length', 'algorithm', 'fuzzy']
topic_1:  ['genomes', 'contigs', 'lineage', 'contig', 'tree', 'assemblies', 'supplementary', 'assigned', 'samples', 'grapetree']
topic_2:  ['cell', 'cells', 'methods', 'expression', 'clustering', 'number', 'dataset', 'model', 'clusters', 'scvis']
topic_3:  ['variant', 'variants', 'ngs', 'wgs', 'normalization', 'scone', 'calling', 'depth', 'performance', 'lrs']
topic_4:  ['mash', 'sketch', 'aligner', 'fastp', 'hash', 'mappings', 'quality', 'size', 'mapq', 'file']
topic_5:  ['species', 'learning', 'tumor', 'detection', 'deep', 'genomics', 'liquid', 'circulating', 'pubmed', 'patients']
topic_6:  ['metagenomic', 'metagenomics', 'microbiome', 'args', 'benchmark', 'regions', 'resistance', 'microbial', 'usa', 'pubmed']
topic_7:  ['assembly', 'coverage', 'graph', 'set', 'illumina', 'distance', 'bias', 'forensic', 'ajb', 'ion']

level3

topic_0:  ['variant', 'read', 'genome', 'aligner', 'resfinder', 'mappings', 'resistance', 'mapq', 'reference', 'reads']
topic_1:  ['genomes', 'assembly', 'reads', 'using', 'genome', 'mash', 'benchmark', 'regions', 'variants', 'coverage']
topic_2:  ['data', 'clustering', 'cell', 'cells', 'normalization', 'genes', 'methods', 'expression', 'number', 'used']
topic_3:  ['data', 'coverage', 'learning', 'genome', 'deep', 'bias', 'sequencing', 'genomics', 'illumina', 'human']
topic_4:  ['read', 'alignment', 'reads', 'genome', 'reference', 'sequencing', 'algorithms', 'bioinformatics', 'dna', 'scone']
topic_5:  ['sequencing', 'species', 'data', 'wgs', 'reads', 'lrs', 'variant', 'depth', 'using', 'kraken']
topic_6:  ['umap', 'data', 'usa', 'university', 'fuzzy', 'author', 'set', 'qiime', 'manuscript', 'manifold']
topic_7:  ['analysis', 'data', 'cell', 'cells', 'methods', 'sequencing', 'gatk', 'expression', 'gene', 'genome']
topic_8:  ['snps', 'drosophila', 'variant', 'amino', 'megares', 'gene', 'snpeff', 'variants', 'args', 'protein']
topic_9:  ['cells', 'data', 'args', 'resistance', 'scvis', 'resistome', 'cell', 'dataset', 'bipolar', 'clusters']
topic_10:  ['cancer', 'dna', 'data', 'sequencing', 'analysis', 'tumor', 'detection', 'pubmed', 'liquid', 'circulating']
topic_11:  ['reads', 'graph', 'assembly', 'ajb', 'contigs', 'bruijn', 'distance', 'spades', 'genome', 'edge']
topic_12:  ['metagenomic', 'analysis', 'data', 'sequencing', 'metagenomics', 'microbiome', 'used', 'microbial', 'dna', 'pubmed']
topic_13:  ['data', 'nat', 'methods', 'integration', 'analysis', 'cell', 'reduction', 'sequencing', 'cells', 'joint']
topic_14:  ['sequencing', 'dna', 'ngs', 'cancer', 'analysis', 'data', 'genome', 'forensic', 'technology', 'variant']
topic_15:  ['kraken', 'reference', 'data', 'sequences', 'genes', 'used', 'lineage', 'sequence', 'genomes', 'genome']

Figure 3. Quality metrics accross training iterations for hierarchical model - bioinformatics dataset

Results for level 0

Sparsity Phi: 0.381 
Sparsity Theta: 0.000
Kernel contrast: 0.891
Kernel purity: 0.938
Results for level 1

Sparsity Phi: 0.567 
Sparsity Theta: 0.000
Kernel contrast: 0.846
Kernel purity: 0.890
Results for level 2

Sparsity Phi: 0.735 
Sparsity Theta: 0.009
Kernel contrast: 0.839
Kernel purity: 0.879
Results for level 3

Sparsity Phi: 0.000 
Sparsity Theta: 0.000
Kernel contrast: 0.461
Kernel purity: 0.388

Figure 4. Spectral clustering results - bioinformatics dataset

Figure 5. Topic hierarchy structure - bioinformatics dataset

Results for NLP set

CPU times: user 44.4 s, sys: 11.9 s, total: 56.3 s
Wall time: 23.9 s
level0

topic_0:  ['data', 'network', 'social', 'used', 'learning', 'spam', 'information', 'embedding', 'networks', 'research']
topic_1:  ['model', 'proceedings', 'conference', 'information', 'learning', 'knowledge', 'text', 'language', 'data', 'methods']

level1

topic_0:  ['data', 'network', 'embedding', 'information', 'news', 'graph', 'clustering', 'learning', 'models', 'networks']
topic_1:  ['clinical', 'social', 'patent', 'text', 'model', 'classification', 'data', 'spam', 'learning', 'detection']
topic_2:  ['knowledge', 'model', 'articles', 'tax', 'used', 'training', 'disease', 'set', 'research', 'quality']
topic_3:  ['proceedings', 'conference', 'language', 'extraction', 'computational', 'knowledge', 'association', 'linguistics', 'learning', 'methods']

level2

topic_0:  ['clustering', 'areas', 'topic', 'institutions', 'recommendation', 'terms', 'thematic', 'technology', 'performance', 'recommendations']
topic_1:  ['model', 'text', 'information', 'used', 'models', 'using', 'clinical', 'conference', 'classification', 'language']
topic_2:  ['event', 'sket', 'reports', 'pathology', 'engineering', 'events', 'argument', 'cancer', 'topics', 'archetype']
topic_3:  ['articles', 'example', 'dream', 'citation', 'training', 'article', 'prefiltering', 'traf', 'trial', 'sampling']
topic_4:  ['knowledge', 'proceedings', 'extraction', 'conference', 'computational', 'language', 'methods', 'entity', 'relation', 'concept']
topic_5:  ['social', 'spam', 'tax', 'detection', 'features', 'twitter', 'techniques', 'users', 'accounts', 'cases']
topic_6:  ['patent', 'questions', 'question', 'problem', 'patents', 'modeling', 'class', 'study', 'problems', 'classification']
topic_7:  ['network', 'networks', 'graph', 'embedding', 'nodes', 'node', 'disease', 'representation', 'gcn', 'drug']

level3

topic_0:  ['areas', 'institutions', 'recommendation', 'thematic', 'recommendations', 'system', 'set', 'collaboration', 'technology', 'institution']
topic_1:  ['example', 'dream', 'house', 'dreams', 'situation', 'reports', 'flying', 'situations', 'falling', 'groups']
topic_2:  ['political', 'model', 'text', 'classification', 'detection', 'work', 'seed', 'label', 'data', 'policy']
topic_3:  ['construction', 'research', 'data', 'text', 'argument', 'analysis', 'nlp', 'documents', 'media', 'mining']
topic_4:  ['proceedings', 'conference', 'extraction', 'learning', 'computational', 'information', 'language', 'association', 'word', 'methods']
topic_5:  ['data', 'news', 'clustering', 'model', 'set', 'methods', 'online', 'models', 'patent', 'problem']
topic_6:  ['data', 'clinical', 'clustering', 'trial', 'emr', 'patient', 'patients', 'vector', 'medical', 'trials']
topic_7:  ['patent', 'question', 'questions', 'word', 'words', 'summarization', 'based', 'model', 'information', 'data']
topic_8:  ['knowledge', 'entity', 'resolution', 'subjectivity', 'methods', 'tax', 'concept', 'entities', 'semantic', 'anaphora']
topic_9:  ['clinical', 'knowledge', 'concept', 'argumentative', 'literature', 'inform', 'mining', 'med', 'disease', 'learning']
topic_10:  ['model', 'articles', 'data', 'topic', 'training', 'citation', 'article', 'used', 'research', 'topics']
topic_11:  ['model', 'models', 'medical', 'clinical', 'bert', 'text', 'biomedical', 'classification', 'language', 'embeddings']
topic_12:  ['learning', 'network', 'embedding', 'graph', 'networks', 'node', 'nodes', 'data', 'information', 'representation']
topic_13:  ['event', 'social', 'lockdown', 'class', 'ratio', 'learning', 'data', 'media', 'events', 'distancing']
topic_14:  ['spam', 'social', 'detection', 'features', 'patent', 'classification', 'learning', 'dataset', 'used', 'text']
topic_15:  ['sket', 'reports', 'pathology', 'data', 'disease', 'cancer', 'network', 'concepts', 'networks', 'fication']

Figure 6. Quality metrics accross training iterations for hierarchical model - bioinformatics dataset

Results for level 0

Sparsity Phi: 0.373 
Sparsity Theta: 0.000
Kernel contrast: 0.871
Kernel purity: 0.914
Results for level 1

Sparsity Phi: 0.570 
Sparsity Theta: 0.000
Kernel contrast: 0.827
Kernel purity: 0.780
Results for level 2

Sparsity Phi: 0.686 
Sparsity Theta: 0.000
Kernel contrast: 0.816
Kernel purity: 0.787
Results for level 3

Sparsity Phi: 0.000 
Sparsity Theta: 0.000
Kernel contrast: 0.471
Kernel purity: 0.402

Figure 7. Spectral clustering results - NLP dataset

Figure 8. Topic hierarchy structure - NLP dataset